45 research outputs found

    Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE.</p> <p>Results</p> <p>We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights.</p> <p>Conclusion</p> <p>SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups.</p> <p>Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.</p> <p/

    Learning from positive examples when the negative class is undetermined- microRNA gene identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species.</p> <p>Results</p> <p>Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70–80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs.</p> <p>Conclusion</p> <p>One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined.</p> <p>Availability</p> <p>The OneClassmiRNA program is available at: <abbrgrp><abbr bid="B1">1</abbr></abbrgrp></p

    Reproducible big data science: A case study in continuous FAIRness.

    Get PDF
    Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes

    Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types.

    Get PDF
    Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits

    Development of Bioinformatics Infrastructure for Genomics Research:

    Get PDF
    Although pockets of bioinformatics excellence have developed in Africa, generally, large-scale genomic data analysis has been limited by the availability of expertise and infrastructure. H3ABioNet, a pan-African bioinformatics network, was established to build capacity specifically to enable H3Africa (Human Heredity and Health in Africa) researchers to analyze their data in Africa. Since the inception of the H3Africa initiative, H3ABioNet's role has evolved in response to changing needs from the consortium and the African bioinformatics community

    The trans-ancestral genomic architecture of glycemic traits

    Get PDF
    Glycemic traits are used to diagnose and monitor type 2 diabetes and cardiometabolic health. To date, most genetic studies of glycemic traits have focused on individuals of European ancestry. Here we aggregated genome-wide association studies comprising up to 281,416 individuals without diabetes (30% non-European ancestry) for whom fasting glucose, 2-h glucose after an oral glucose challenge, glycated hemoglobin and fasting insulin data were available. Trans-ancestry and single-ancestry meta-analyses identified 242 loci (99 novel; P < 5 x 10(-8)), 80% of which had no significant evidence of between-ancestry heterogeneity. Analyses restricted to individuals of European ancestry with equivalent sample size would have led to 24 fewer new loci. Compared with single-ancestry analyses, equivalent-sized trans-ancestry fine-mapping reduced the number of estimated variants in 99% credible sets by a median of 37.5%. Genomic-feature, gene-expression and gene-set analyses revealed distinct biological signatures for each trait, highlighting different underlying biological pathways. Our results increase our understanding of diabetes pathophysiology by using trans-ancestry studies for improved power and resolution. A trans-ancestry meta-analysis of GWAS of glycemic traits in up to 281,416 individuals identifies 99 novel loci, of which one quarter was found due to the multi-ancestry approach, which also improves fine-mapping of credible variant sets.Peer reviewe

    Spatial, temporal, and demographic patterns in prevalence of smoking tobacco use and attributable disease burden in 204 countries and territories, 1990-2019 : a systematic analysis from the Global Burden of Disease Study 2019

    Get PDF
    Background Ending the global tobacco epidemic is a defining challenge in global health. Timely and comprehensive estimates of the prevalence of smoking tobacco use and attributable disease burden are needed to guide tobacco control efforts nationally and globally. Methods We estimated the prevalence of smoking tobacco use and attributable disease burden for 204 countries and territories, by age and sex, from 1990 to 2019 as part of the Global Burden of Diseases, Injuries, and Risk Factors Study. We modelled multiple smoking-related indicators from 3625 nationally representative surveys. We completed systematic reviews and did Bayesian meta-regressions for 36 causally linked health outcomes to estimate non-linear dose-response risk curves for current and former smokers. We used a direct estimation approach to estimate attributable burden, providing more comprehensive estimates of the health effects of smoking than previously available. Findings Globally in 2019, 1.14 billion (95% uncertainty interval 1.13-1.16) individuals were current smokers, who consumed 7.41 trillion (7.11-7.74) cigarette-equivalents of tobacco in 2019. Although prevalence of smoking had decreased significantly since 1990 among both males (27.5% [26. 5-28.5] reduction) and females (37.7% [35.4-39.9] reduction) aged 15 years and older, population growth has led to a significant increase in the total number of smokers from 0.99 billion (0.98-1.00) in 1990. Globally in 2019, smoking tobacco use accounted for 7.69 million (7.16-8.20) deaths and 200 million (185-214) disability-adjusted life-years, and was the leading risk factor for death among males (20.2% [19.3-21.1] of male deaths). 6.68 million [86.9%] of 7.69 million deaths attributable to smoking tobacco use were among current smokers. Interpretation In the absence of intervention, the annual toll of 7.69 million deaths and 200 million disability-adjusted life-years attributable to smoking will increase over the coming decades. Substantial progress in reducing the prevalence of smoking tobacco use has been observed in countries from all regions and at all stages of development, but a large implementation gap remains for tobacco control. Countries have a dear and urgent opportunity to pass strong, evidence-based policies to accelerate reductions in the prevalence of smoking and reap massive health benefits for their citizens. Copyright (C) 2021 The Author(s). Published by Elsevier Ltd.Peer reviewe

    Spatial, temporal, and demographic patterns in prevalence of chewing tobacco use in 204 countries and territories, 1990-2019 : a systematic analysis from the Global Burden of Disease Study 2019

    Get PDF
    Interpretation Chewing tobacco remains a substantial public health problem in several regions of the world, and predominantly in south Asia. We found little change in the prevalence of chewing tobacco use between 1990 and 2019, and that control efforts have had much larger effects on the prevalence of smoking tobacco use than on chewing tobacco use in some countries. Mitigating the health effects of chewing tobacco requires stronger regulations and policies that specifically target use of chewing tobacco, especially in countries with high prevalence. Findings In 2019, 273 center dot 9 million (95% uncertainty interval 258 center dot 5 to 290 center dot 9) people aged 15 years and older used chewing tobacco, and the global age-standardised prevalence of chewing tobacco use was 4 center dot 72% (4 center dot 46 to 5 center dot 01). 228 center dot 2 million (213 center dot 6 to 244 center dot 7; 83 center dot 29% [82 center dot 15 to 84 center dot 42]) chewing tobacco users lived in the south Asia region. Prevalence among young people aged 15-19 years was over 10% in seven locations in 2019. Although global agestandardised prevalence of smoking tobacco use decreased significantly between 1990 and 2019 (annualised rate of change: -1 center dot 21% [-1 center dot 26 to -1 center dot 16]), similar progress was not observed for chewing tobacco (0 center dot 46% [0 center dot 13 to 0 center dot 79]). Among the 12 highest prevalence countries (Bangladesh, Bhutan, Cambodia, India, Madagascar, Marshall Islands, Myanmar, Nepal, Pakistan, Palau, Sri Lanka, and Yemen), only Yemen had a significant decrease in the prevalence of chewing tobacco use, which was among males between 1990 and 2019 (-0 center dot 94% [-1 center dot 72 to -0 center dot 14]), compared with nine of 12 countries that had significant decreases in the prevalence of smoking tobacco. Among females, none of these 12 countries had significant decreases in prevalence of chewing tobacco use, whereas seven of 12 countries had a significant decrease in the prevalence of tobacco smoking use for the period. Summary Background Chewing tobacco and other types of smokeless tobacco use have had less attention from the global health community than smoked tobacco use. However, the practice is popular in many parts of the world and has been linked to several adverse health outcomes. Understanding trends in prevalence with age, over time, and by location and sex is important for policy setting and in relation to monitoring and assessing commitment to the WHO Framework Convention on Tobacco Control. Methods We estimated prevalence of chewing tobacco use as part of the Global Burden of Diseases, Injuries, and Risk Factors Study 2019 using a modelling strategy that used information on multiple types of smokeless tobacco products. We generated a time series of prevalence of chewing tobacco use among individuals aged 15 years and older from 1990 to 2019 in 204 countries and territories, including age-sex specific estimates. We also compared these trends to those of smoked tobacco over the same time period. Findings In 2019, 273 & middot;9 million (95% uncertainty interval 258 & middot;5 to 290 & middot;9) people aged 15 years and older used chewing tobacco, and the global age-standardised prevalence of chewing tobacco use was 4 & middot;72% (4 & middot;46 to 5 & middot;01). 228 & middot;2 million (213 & middot;6 to 244 & middot;7; 83 & middot;29% [82 & middot;15 to 84 & middot;42]) chewing tobacco users lived in the south Asia region. Prevalence among young people aged 15-19 years was over 10% in seven locations in 2019. Although global age standardised prevalence of smoking tobacco use decreased significantly between 1990 and 2019 (annualised rate of change: -1 & middot;21% [-1 & middot;26 to -1 & middot;16]), similar progress was not observed for chewing tobacco (0 & middot;46% [0 & middot;13 to 0 & middot;79]). Among the 12 highest prevalence countries (Bangladesh, Bhutan, Cambodia, India, Madagascar, Marshall Islands, Myanmar, Nepal, Pakistan, Palau, Sri Lanka, and Yemen), only Yemen had a significant decrease in the prevalence of chewing tobacco use, which was among males between 1990 and 2019 (-0 & middot;94% [-1 & middot;72 to -0 & middot;14]), compared with nine of 12 countries that had significant decreases in the prevalence of smoking tobacco. Among females, none of these 12 countries had significant decreases in prevalence of chewing tobacco use, whereas seven of 12 countries had a significant decrease in the prevalence of tobacco smoking use for the period. Interpretation Chewing tobacco remains a substantial public health problem in several regions of the world, and predominantly in south Asia. We found little change in the prevalence of chewing tobacco use between 1990 and 2019, and that control efforts have had much larger effects on the prevalence of smoking tobacco use than on chewing tobacco use in some countries. Mitigating the health effects of chewing tobacco requires stronger regulations and policies that specifically target use of chewing tobacco, especially in countries with high prevalence. Copyright (c) 2021 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license.Peer reviewe
    corecore